Matteo Ceccarello
University of Padova
Joint work with Johann Gamper (U. Bolzano)
Fix a subsequence length \(w\)
Consider all pairs of (non-overlapping) subsequences of length \(w\) of a given time series. The first \(k\) by increasing distance are the top-\(k\) motifs.
Still, it’s a \(\Omega(n^2)\) algorithm
Finds motifs out of 1 billion long time series in one day, using 5 GPUs
To address these challenges, we will use an approach based on Locality Sensitive Hashing (LSH)
We consider the z-normalized Euclidean distance \[
d(x, y) = \sqrt{\sum_{i\in[1,w]} \left(
\frac{x_i - \bar{x}}{\sigma_x}
-
\frac{y_i - \bar{y}}{\sigma_y}
\right)^2
}
\] In this example we have a window length \(w = 640\),
hence we have vectors in \(\mathbb{R}^{640}\)
Locality Sensitive Hash function:
\[ h(\vec{x}) = \left\lfloor\frac{\vec{a} \cdot \vec{x} + b}{r}\right\rfloor \]
The key point is that we only compute the distance of subsequences falling
into the same bucket.
To lower the collision probability we concatenate \(k\) hash functions \[ \hat{h}(\vec{x}) = \langle h_1(\vec{x}), \dots, h_k(\vec{x}) \rangle \] this makes for a better precision of the index.
And to increase the recall of the index we repeat \(L\) times.
Build a priority queue to hold candidate motifs
Fix \(K\)
Repeat:
Sample \(K\) hash functions, hash all the points, and compute the distances of pairs in the same bucket
Let \(d_k\) be the distance of the \(k\)-th motif
If the probability of having seen all pairs
at distance \(\ge d_k\) is at least \(1-\delta\),
then return.
In each iteration we compute the distance of all subsequences in the same bucket.
In the first iteration, with \(k=4\), we discover a pair at distance 2
.
After 10 repetitions, we find a pair at distance 1
.
After 100 repetitions we did not find a better pair,
and the success probability is about 0.85
We then consider shorter prefixes of the hashes,
going through the 100 repetitions again.
After 15 repetitions, we observe that the
success probability is above our target, and thus return.
Theorem 1 The LSH index construction takes time \[ O(K_{max} \cdot \sqrt{L_{max}} n\log n) \]
Theorem 2 Let \(d(m_k)\) be the distance of the \(k\)-th motif, and \(i'\le K\), \(j' \le L\) be the parameters used by the optimal LSH algorithm. Then, the algorithm considers \[ O\left( j'\sum_{m\in T^w\times T^w} p(d(m))^{i'} + (L-j')\sum_{m\in T^w\times T^w} p(d(m))^{i'+1} \right) \] pairs in expectation.
Find the top-10 motifs.
| dataset | \(n\) (millions) | RC |
|---|---|---|
| astro | 1 | 8.63 |
| GAP | 2 | 9.17 |
| freezer | 7 | 7.95 |
| ECG | 7 | 109.06 |
| HumanY | 26 | 581.03 |
| Whales | 308 | 21.66 |
| Seismic | 1000 | 274.44 |
Relative Contrast measures difficulty: higher is easier. \[ RC = \frac{d_{avg}}{d_{m_k}} \]
Synthetic data, planted motifs with different relative contrasts. SCAMP-gpu only has one line since it is data-independent.
For shorter time series, or if the relative contrast is very small, use the Matrix Profile.
For time series of a few million values and above, with a not-to-small relative contrast, try Attimo
https://github.com/Cecca/attimo
matteo.ceccarello@unipd.it
Running for top-10 motifs, for different number of repetitions.
Data from LIGO:
Attimo: 6 hoursSCAMP: \(\approx 1\) secondfreezer and the 7-th motif7M points, window 5000